#).R fileapp.R file.Rmd file``` for R chunks.Rnw file<<>>= and @ for R chunksThe pound sign (#) is used for comments in R. Below are some of the most common syntax for arithmetic in R.
## [1] 15
## [1] 94
## [1] 36
## [1] 5
## [1] 16
## [1] 3
A basic tool in statistical programming is called a variable. A variable allows users to store a value (e.g. 7) or an object (e.g. a function description). You can then the name of the variable later on to easily access the value or the object that is stored within it. When creating a variable in R, use the <- or = grammar with the variable name on the left and the variable value on the right.
## [1] 19
We can perform arithmetic on variables.
# Assign the value 19 to x
y <- 7
# Add the values of x and y and store the result in z
z <- x + y
# Print the value of z
z## [1] 26
You can store a function as a variable. Here we create a function that adds two values. The name of the function is sumTwoValues. The function requires the user to input two variables with values (two “input parameters”). The function then adds the values of the two variables the user input and stores into a variable called sum. The value of sum is then returned to the user.
Now, if we call the function name, it will simply list the code of the function.
## function(x, y) {
## sum <- x + y
## return(sum)
## }
To actually use the function, we must call the function name (sumTwoValues) and input the two required variables (x and y).
## [1] 3
## [1] 180
Here is another example of creating a function in R that converts fahrenheit to celsius. This function requires one input parameter temp_F from the user.
We can call the function as follows:
## [1] -1.111111
Or as follows:
## [1] -1.111111
Now, we can run various variations of the same function using different input paramater values of interest:
## [1] -1.111111
## [1] 4.444444
## [1] 10
## [1] 15.55556
## [1] 21.11111
We won’t go into details of loops in R. These can be found in many online tutorials. But for demonstrative purposes, the code above could be further reduced as follows:
## [1] -1.111111
## [1] 4.444444
## [1] 10
## [1] 15.55556
## [1] 21.11111
How did this work? First, let’s look at the seq() function. This is built-in R function. If we run the name of the function, we will only see the code of the function, which is not helpful.
## function (...)
## UseMethod("seq")
## <bytecode: 0x7fc38fcb4bd8>
## <environment: namespace:base>
Instead, let’s run a help() command on the seq() function. We can do this by running either of the two below commands:
Both below code perform the same task and create the values 30, 40, 50, 60, 70. Note that you do not have to explicitly call the input parameter name for the command to work.
## [1] 30 40 50 60 70
## [1] 30 40 50 60 70
R uses various data types. Some of the most basic types to get started are characters, numerics, integers, and logicals.
# Set my_numeric to be 9.3
my_numeric <- 9.3
# Set my_character to be "mouse"
my_character <- "mouse"
# Set my_logical to be TRUE
my_logical <- TRUEWe can update the variable we stored to have new values.
# Change my_numeric to be 10
my_numeric <- 10
# Change my_character to be "mars"
my_character <- "mars"
# Change my_logical to be FALSE
my_logical <- FALSEThe str() function is very useful in R to help you understand what data type and values are associated with variables.
## num 10
## chr "mars"
## logi FALSE
Note that you can determine all variables you have stored in your current R session by typing:
## [1] "fahrenheit_to_celsius" "my_character" "my_logical"
## [4] "my_numeric" "myTempF" "sumTwoValues"
## [7] "t" "temp_F" "x"
## [10] "y" "z"
You can remove all variables in your current R session using the rm(list=ls()) function:
R has several data structures. These include atomic vector, list, matrix, factors, and data frame.
Vectors are collections of elements that are usually of mode character, logical, integer or numeric. We can create an empty vector using vector(). (By default the mode is logical.)
## logical(0)
## [1] "" "" ""
## [1] "" "" ""
You can also generate vectors by directly specifying their contents. R will then infer the appropriate mode of storage for the vector.
In addition to str(), you can also examine vectors using typeof(), length(), class().
## chr [1:3] "Nailil" "Quang" "Kaness"
## [1] "character"
## [1] 3
## [1] "character"
You can add elements to vector using the combine (c()) function.
R allows missing data in vectors. Missing data are represented as NA (Not Available). The function is.na() informs which elements of vectors are missing data, and the function anyNA() returns TRUE if the vector contains at least one missing value.
## [1] FALSE TRUE FALSE FALSE TRUE
## [1] FALSE FALSE FALSE FALSE FALSE
## [1] TRUE
## [1] FALSE
You can mix different types within a vector In that case, R will create a vector with a mode that seems to best accommodate all the elements it contains. Conversion between modes of storage is known as “coercion”. For example, gues what the following mixed-type input vectors end up as their storage.
See if your guess is correct!
## chr [1:2] "3.3" "p"
## num [1:2] 1 5
## chr [1:2] "q" "TRUE"
In R, matrices are an extension of numeric or character vectors. They are simply vectors with dimensions (the number of rows and columns). As with vectors, the elements of a matrix must be of the same data type.
## [,1] [,2]
## [1,] NA NA
## [2,] NA NA
We can investigate our matrix using various attribute functions (like dim(), class(), and typeof().
## [1] 2 2
## [1] "matrix" "array"
## [1] "logical"
We can fill this matrix in with values, which is done in R column-wise.
Another way to fill matrices is to bind columns or rows using rbind() and cbind() (“row bind” and “column bind”, respectively).
## x y
## [1,] 4 10
## [2,] 5 11
## [3,] 6 12
Elements of a matrix can be obtained by specifying the indices along each dimension (e.g. “row” and “column”) in single square brackets.
## x
## 6
Lists act as containers in R. Unlike vectors, elements of lists can be more than one mode and can contain any mixture of data types. Lists are sometimes refered to as “generic vectors”, because the elements of a list can by of any type of R object. Lists can even contain lists elements within themselves (“nested lists”). These properties makes lists fundamentally different from vectors.
## [[1]]
## [1] 1
##
## [[2]]
## [1] "a"
##
## [[3]]
## [1] TRUE
Elements within lists can be obtained using double square brackets.
## [1] "a"
List elements can be named.
## $myNum
## [1] 1
##
## $myChar
## [1] "t"
##
## $myBool
## [1] TRUE
## [1] "myNum" "myChar" "myBool"
We can obtain the element values within the named list by calling the name with the $ notation.
## [1] "t"
Lists can be very helpful inside functions. This is because functions in R can only return a single object. Therefore, you can “concatentate” (staple) together numerous results values into a single object that the function can then return.
A data frame is one of the most important data types in R and is often used for tabular information in statistics. A data frame is a list, where every element has the same length (i.e. data frame is a “rectangular” list).
read.csv() and read.table())as.matrix()data.frame() function.We can create a data frame below.
## id var1 var2
## 1 a 1 11
## 2 b 2 13
## 3 c 3 15
## 4 d 4 17
## 5 e 5 19
## 6 f 6 21
## 7 g 7 23
## 8 h 8 25
## 9 i 9 27
## 10 j 10 29
There are many useful data frame functions:
head() - shows top 6 rowstail() - shows bottom 6 rowsdim() - returns dimensions of data frame (number of rows and columns)nrow() - number of rowsncol() - number of columnsstr() - structure of data frame - name, type and preview of data in each columnnames() or colnames() - both show the names attribute for a data framesapply(dataframe, class) - shows the class of each column in the data frame## id var1 var2
## 1 a 1 11
## 2 b 2 13
## 3 c 3 15
## 4 d 4 17
## 5 e 5 19
## 6 f 6 21
## id var1 var2
## 5 e 5 19
## 6 f 6 21
## 7 g 7 23
## 8 h 8 25
## 9 i 9 27
## 10 j 10 29
## [1] 10 3
## [1] 10
## [1] 3
## 'data.frame': 10 obs. of 3 variables:
## $ id : chr "a" "b" "c" "d" ...
## $ var1: int 1 2 3 4 5 6 7 8 9 10
## $ var2: num 11 13 15 17 19 21 23 25 27 29
## [1] "id" "var1" "var2"
## id var1 var2
## "character" "integer" "numeric"
Since data frames are rectangular, elements of data frames can be accessed by specifying the row and the column index in single square brackets.
## [1] 13
Since data frames are special forms of lists, we can obtain columns using the list notation, i.e. either double square brackets or a $.
## [1] 11 13 15 17 19 21 23 25 27 29
## [1] 11 13 15 17 19 21 23 25 27 29
If you are used to working with Excel-like format, you can also “View” the data frame in that format as follows:
Since statisticans often work with data frame structures, packages have been written that allow for additional smooth use and manipulation of data frames in addition to base R functions. One popular package for working with data frames in dplyr.
Scientists often store data in Excel spreadsheets. There are various R packages that can help R users access data from Excel spreadsheets (XLConnect, gdata, RODBC, RExcel, and xlsx). However, many users find it simpler to save their spreadsheets in comma-separated values files (.CSV) and then use base R functionality to read and manipulate the data.
We can read in a CSV file (Labmates.csv) that contains each of our names and spirit animals. First, make sure you are located in the directory where the file is located. You can do this using the “set working directory” command (setwd()) or by manually choosing “Set As Working Directory” in your RStudio space. Then, you should be able to run the command:
It may be easier as well to run the following command, which will open a GUI that allows you to select the file. location and the file.
Looks like we each have a spirit animal. Vinh san is not happy. He really loves cats. Plus, he really does not want to have a pangolin as a spirit animal these days. Let’s switch the spirit animal of Vinh and Lindsay (who originally had cat). Then, let’s also add a column to indicate who attended the R crash course today.
Now, we can save/write our updated file to a desired location.
If you do not like the first column with the indices, you can save as follows:
Note there is also a very flexible storage format in R in which you can save any type of object (not just CSV or data frame). This can be done with the RDS file format using the saveRDS() and the readRDS() functions.
We can now read in the object and, if desired, set it to a new variable attendanceRDS.
There are numerous built-in datasets in R that you can use to practice and/or to create “minimal working examples” when you cannot use your real data. You can see a list of example R datasets by typing:
One example dataset is mtcars. We can load it and examine it.
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
There are numerous basic plotting types in R that do not require using extra packages. For each plot, there is often a more powerful equivalent using ggplot2(), which we need to first install. The ggplot2() package is on CRAN. For packages installed on CRAN, we can install using the install.packages() function. We can then read the package into our current R session using the library() function.
In base R, we can perform using:
It can also be performed in ggplot2 using qplot():
We can create a line plot based on the pressure dataset in R.
We can instead pass the argument type = “s” to produce a stepped line chart:
We can use qplot() to get similar results by using the geom argument. In graphics, geom are geometric objects (lines, points, etc.) that visually represent the data. In this case, we can represent the data using a line and then also points:
There are countless other plots that can be made in base R and ggplot2, including box plots, bar plots, histograms, stem and leaf plots, mosaic plots. There are great online resources to practice making plots and entire textbooks. As shown earlier in this tutorial, there are also cheat sheets available in R Studio. With ggplot2 graphics, you can also use additional packages to render them interactive fairly easily. You can see neat examples of this from the function ggplotly here.
There are various ways to download R packages from online resources. Three common ways are on CRAN, GitHub, and Bioconductor. As an example, the ggplot2 package is available on CRAN here. Hence, we could install the CRAN version using the code we already saw:
You can also install the developmental version from GitHub (if there is one). It seems ggplot2 has a GitHub repository here. The functionality to install a repository from GitHub into R can be done with the devtools package, which is on CRAN. Here, we first install the devtools package, read it into our R session, and then use one of its functions (install_github()) to install the ggplot2 version on GitHub.
# install devtools package
devtools::install_github("r-lib/devtools")
library(devtools)
devtools::install_github("tidyverse/ggplot2")Some packages, especially ones that relate to bioinformatics, are on Bioconductor. For example, the RNA-seq analysis packages DESeq2 and edgeR are on Bioconductor. If you try to install these packages using the CRAN functionality, you will get an error (package ‘edgeR’ is not available (for R version 4.0.2)).
Instead, to install a Bioconductor package into R, you will need the following type of code:
Sometimes you cannot troubleshoot an error in R, even after thinking carefully and reading about potential underlying causes. One thing you may want to do in that case is ask a colleague or post on StackOverflow. Sometimes you cannot show your real data (due to privacy issues). Moreover, on StackOverflow, you cannot upload any data. Hence, you often need to create a “minimal working example”: that is, a simulated dataset that has the same data types and formats as your real data that can be used to simulate the error. Let’s look at an example of how to do this.
Say, you are working with a sensitive dataset. Let’s read it in first.
We can see this dataset contains patient names, sound gene count information, and a phenotype that could be sensitive (cancer status, mental illness, etc).
## 'data.frame': 100 obs. of 3 variables:
## $ patient : chr "Patient1" "Patient2" "Patient3" "Patient4" ...
## $ geneCount: Factor w/ 62 levels "5","6","7","8",..: 5 11 54 12 34 48 23 59 30 55 ...
## $ phenotype: chr "Yes" "No" "No" "Yes" ...
Say you wanted to sum up all the gene counts in this dataset. Usually, this can be achieved easily by applying the sum() function to the corresponding column. However, when we try to do this, we receive an error (“‘sum’ not meaningful for factors”):
If you are unable to figure out this error and want to consult colleagues or StackOverflow, you will need to provide code to them that creates a minimal working example data frame (mweDF) that has the same data types as your sensitive data frame (senDF).
patient = paste0("Patient", 1:100)
geneCount = as.factor(sample(1:100, 100, replace=TRUE))
phenotype = sample(c("Yes","No"), 100, replace = TRUE)
mweDF = data.frame(patient = patient, geneCount = geneCount, phenotype = phenotype)
str(mweDF)## 'data.frame': 100 obs. of 3 variables:
## $ patient : chr "Patient1" "Patient2" "Patient3" "Patient4" ...
## $ geneCount: Factor w/ 61 levels "2","3","5","6",..: 3 13 7 15 34 54 47 27 35 1 ...
## $ phenotype: chr "Yes" "Yes" "No" "Yes" ...
Then, your colleagues can simulate your error by running the sum() function on your simulated data frame.
If they have more experience, they can then provide a suggestion. In this case, to change your geneCount column to be of numeric type.
Indeed, we now no longer get that error and can successfully sum that column.
## [1] 3168